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r-{ , Recent results concerning asymptotic Bayes-optimality under spar- 

sity (ABOS) of multiple testing procedures are extended to fairly 
fvj I generally distributed effect sizes under the alternative. An asymp- 

totic framework is considered where both the number of tests m and 
the sample size n go to infinity, while the fraction p of true alter- 
natives converges to zero. It is shown that under mild restrictions 
on the loss function nontrivial asymptotic inference is possible only 
^aJ ' if n increases to infinity at least at the rate of logm. Based on this 

ri , assumption precise conditions are given under which the Bonferroni 

correction with nominal Family Wise Error Rate (FWER) level a 
and the Benjamini- Hochberg procedure (BH) at FDR level a are 
asymptotically optimal. When n oc logm then a can remain fixed, 
whereas when n increases to infinity at a quicker rate, then a has to 
converge to zero roughly like n~^". Under these conditions the Bon- 
^ ■ ferroni correction is ABOS in case of extreme sparsity [p oc m~^), 

^f^ ' while BH adapts well to the unknown level of sparsity. 

l/^ , In the second part of this article these optimality results are car- 

t*^ ' ried over to model selection in the context of multiple regression with 

^T ' orthogonal regressors. Several modifications of Bayesian Information 

w-\ , Criterion are considered, controlling either FWER or FDR, and con- 

f^ ' ditions are provided under which these selection criteria are ABOS. 

f^ I Finally the performance of these criteria is examined in a brief sim- 

ulation study. 

/\ ' 1. Introduction. Driven by a vast number of applications, over the last 

j^ ■ few years multiple hypothesis testing with sparse alternatives has become a 

topic of intensive research (see, [1], [10], [16], [17], [28] or [32]). As a result of this 
interest many new multiple testing procedures have been proposed, which can 
be compared according to several different optimality criteria. In the classical 
context a multiple testing procedure is considered to be optimal if it maximizes 
the number of true discoveries, while keeping one of the type I error measures 
(like Family Wise Error Rate, False Dicovery Rate or the expected number 
of false positives) at a certain, fixed level (see, [27], [31], [15], [34], [33], [22], 
[38] or [39]). A different notion of optimality is proposed in [36] and [8], which 
investigate multiple testing procedures in the context of minimizing the Bayes 
risk. 



AMS 2000 subject classifications: Primary 62C25,62F05; secondary 62C10 
Keywords and phrases: Multiple testing, Model selection, FDR, Bayes oracle, asymptotic 
optimality, two groups model 



1 
imsart-imsgeneric ver. 2009/08/13 file: AB0S2_arxiv.tex date: July 13, 2011 
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In many applications of high-dimensional multiple testing it is assumed that 
the proportion p of true alternative hypotheses among all tests is very small. 
In asymptotic analysis this is often expressed by the sparsity assumption, that 
p decreases to as the total number of tests m increases to infinity. Recently, 
substantial efforts have been made to understand the asymptotic properties of 
multiple testing under sparsity (see, [16], [17], [1], [8]). 

Bogdan et al. [8] consider the problem of testing hypotheses about means ^i 
in normal populations Xi ~ N{^i,a'^), i = 1, . . . , m. Their analysis is based on 
a two-groups model, which assumes that the unknown means are generated by 
the scale mixture of two normal distributions: null and alternative. The classical 
case of testing Hoi : ^i = corresponds to the situation when the variance of the 
null distribution is equal to 0. In [8] the ratio u of variances of the alternative 
distribution of Hi and the null distribution of Xi , slowly increases to infinity as 
p —7- 0, at a rate which guarantees that the limiting power of the Bayes classifier 
is larger than and smaller than 1. Such sequences of alternative distributions 
are considered to be "on the verge of detectability" . The Bayes risk is computed 
assuming that losses generated by the type I and type II errors are the same 
for all tests, and the total loss is the sum of losses for individual tests. In case 
of known p, a"^ and u the risk is minimized by using Bayes classifiers for each 
individual test. This optimal rule, which is in practice unattainable, is referred 
to as the Bayes oracle. 

Under the described asymptotic assumptions a multiple testing rule is classi- 
fied as asymptotically Bayes optimal under sparsity (ABOS) if the ratio of the 
corresponding Bayes risk and the risk of the Bayes oracle converges to one. Bog- 
dan et al. [8] characterize the class of multiple testing rules with fixed threshold 
which are ABOS, and they provide conditions under which the Bonferroni cor- 
rection and the popular Benjamini-Hochberg multiple testing procedure (BH, 
[3]) are asymptotically optimal. 

In the first part of this paper we extend the results of [8] concerned with 
testing Hoi : ^Uj = to the case when the distribution of ^i under the alternative 
v{li) is fixed and not necessarily normal, while the number of individuals n used 
to calculate the test statistics Xi = ^ X^?=i -^ij increases with m. It turns out 
that, given p oc m~", signals are at the verge of detectability exactly when 
n oc logm. This situation is notably relevant in the context of bioinformatics 
data, where n is usually much smaller than m. We show that in this case BH 
and the Bonferroni correction are ABOS under the same assumptions as in [8]. 
In particular, we show that if i^(/i) has a positive and bounded density on the 
real line then the Bonferroni correction at a fixed FWER a £ (0, 1) is ABOS if 
p oc m^^ and the ratio of losses for the false positive and the false negative 5 
decreases to at such a rate that log 5 = o(log m). In contrast BH at a fixed FDR 
level a G (0, 1) adapts very well to the unknown level of sparsity and is ABOS 
whenever p oc m~^, j3 G (0, 1]. As explained in [8] the assumption of decreasing 
5 is quite reasonable since the cost of missing a true signal usually increases 
when the total number of signal decreases. We also show that if p oc m~l^ with 
/3 G (0, 1] then the step-down version of the FDR controlling procedure, SD, is 
ABOS under the same conditions as BH. 
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Unlike in [8] we also consider the case where the power of the Bayes oracle 
converges to 1. For p oc m^^ this relates to the case where n increases to infinity 
at a quicker rate than log m. We show that in this case BH and SD are ABOS 
for any /3 G (0, 1] as long as FDR levels decrease to approximately at the 
rate of n~^'^, while 6 is bounded from above and such that log 5 = o(logm). 
Similarly, the Bonferroni correction is ABOS if its FWER converges to zero at 
the rate n~^''^ and p oc — . In this case the only assumption on v is that it has 
a positive and bounded density in a neighborhood of 0. Extending the results 
of [8] to a more general class of distributions is based on techniques introduced 
by [29], where nontrivial modifications are required to deal with sparsity. 

In the second part of the paper we use the results on multiple testing rules 
to prove asymptotic optimality of some model selection criteria for sparse least 
squares regression. Here we concentrate on the orthogonal design and study the 
two cases of known and unknown variance of the error term cr^. As discussed 
in [6], in case of orthogonal design with known a, penalized likelihood model 
selection criteria work analogously to multiple testing procedures which verify 
individually the significance of each regression coefficient. Based on this analogy 
it is very easy to prove that popular model selection criteria, like AIC [2] or 
BIG [35], are not consistent when ^ increases to infinity (see [6]). Specifically, 
under this scenario the expected number of false discoveries increases to infinity. 

To solve this problem some modifications of AIC [12] and BIC (see, [5, 13]) 
were recently proposed in the literature. In this article we will concentrate on 
modifications of BIC, which is more appropriate to consider when one aims at 
minimizing the misclassification rate, or in our context the Bayes risk based on 
a generalized 0-1 loss. The first of the considered criteria, mBIC, was derived in 
[5] in a Bayesian setting using a prior on the model dimension which assumes 
that the expected number of true regressors does not depend on m. In case of 
orthogonality and known a it was pointed out in [6] that mBIC is controlling 
the FWER. Optimality results at a sparsity level p oc m~^ follow immediately 
from the analysis for multiple testing. 

In view of results on multiple testing it would actually be of great interest to 
study model selection criteria which control the FDR. In [1] penalized model 
selection schemes are discussed which have exactly this property. Quite sim- 
ilar penalties have been discussed in [23] and [26]. Starting from the penalty 
of [1] we will introduce several new modifications of BIC (niBICl - mBIC3), 
where mBIC2 has been shown already to perform very well in the application 
of genome wide association studies [24] . In case of known a we prove that the 
FDR controlling criteria are ABOS for a wide range of sparsity levels, satisfying 
for example p = m~" , with /3 G (0, 1]. 

In most applications it is much more realistic to assume that a is not known. 
Under sparsity it is rather difficult to get reliable estimates on a, and for that 
reason optimality results on the corresponding model selection criteria under 
sparsity are very rare in the literature. In a Bayesian approach with normally 
distributed error terms, a is integrated out and in the corresponding version 
of BIC the residual sum of squares RSS is replaced by log RSS. We will show 
that in this context mBIC is again ABOS in case of extreme sparsity. The 
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conditions we need for unknown a are not much more restrictive than for known 
a. Our proof is technically rather involved, and cannot be easily extended to 
prove ABOS for mBICl - mBIC3. However, in analogy to the case of known 
variance we conjecture that these criteria should be ABOS for a wide range of 
sparsity levels. This conjecture is underpinned by simulations, which show good 
properties of the new versions of mBIC both for known and unknown a. 

The rest of the paper is organized as follows. In Section 2 we present results 
for multiple testing, whereas Section 3 focuses on linear regression models under 
orthogonality. The main emphasis of Sections 2.1 and 2.2 is the generalization 
of results from [8] to the situation of general distributions under the alternative. 
Section 2.3 shows ABOS of Bonferroni correction in case of extreme sparsity. 
The most important theorems on multiple testing are given in Section 2.4, 
where ABOS of step-up and step-down FDR controlling procedures is proven. 
These results are needed in Section 3.2 to show ABOS of the FDR-controlling 
model selection criteria, after ABOS of mBIC for known variance was shown 
in Section 3.1. Optimality results of mBIC for unknown variance are proved in 
Section 3.3. Finally in Section 4 different model selection criteria are compared 
in a small simulation study. Most proofs of technical results can be found in the 
Appendix. 

2. ABOS for multiple testing rules. Consider a set of ra normal popula- 
tions M{fii,a'^), i = 1, . . . ,m. We are interested in testing point null hypotheses 
Hoi : fii = against the alternatives Hj^i : fii ^ 0, based on simple random 
samples Xi = {Xu, . . . ,Xni) of size n from each of these populations. The ef- 
fects under study fii are supposed to be independent and identically distributed 
according to a mixture distribution 

(2.1) Vmix = (l - P)do + Pl^ , 

where do is the Dirac measure at 0, z/ is a probability measure on the real 
line describing the distribution of fii under the alternative, and p £ (0, 1) is 
the proportion of alternatives among all tests. Since ly describes the alternative 
distribution of the different fii, we assume that J^({0}) = 0. Furthermore both 
positive and negative values of fii should be possible, that is 

(2.2) z/(-oo,0)>0 and z^(0,oo)>0. 

From (2.1) it easily follows that the marginal distribution of the sample mean 
^i = h Si=i ^i« ^^ ^^^ mixture 

(2.3) Xir^{l-p)Ar{0,a'^/n)+p {u * Ar{0,a^ /n)) , 



where the pdf of the second measure is computed by convolution of i^ and 
AA(0,c7Vn). 

Our decision theoretic framework for multiple testing is based on a general- 
ization of the standard 0-1 loss. There are m decisions to be made. For each 
false rejection (type I error) we assign a loss of 6o, and for missing a true signal 
(type II error) a loss of 6a- The total loss of a multiple testing procedure is then 
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defined as the sum of losses for individual tests [30]. The total loss is clearly 
minimized by applying the Bayes classifier to each individual test, the decision 
rule which was called Bayes oracle in [8]. 

Hence our first task is to determine the critical values a„ and 6„ corresponding 
to the Bayes classifier for each individual test. As noted in [29], Up £ (0, 1) then 
for any measure i/ satisfying (2.2) and sufficiently large n, the Bayes classifier 
chooses Hoi if Xi G {an,bn), where the critical values a„ and 6„ are uniquely 
defined by 

a„ < < 6„ 
(2.4) (1 - p)5o =p 5a / exp ( n{an^ - -^r-^) I du{fi) , 



cr^ 



2C72 



n 



(1 - p)5q = p 5a exp ( n{bn— - r-^) j du{fi) . 
n 
Let 5 = 5q/5a denote the ratio of type I error and type II error losses, and 
let / = (1 — p)/p which serves as a measure of sparsity. In the forthcoming 
asymptotic analysis we will assume that ?Ti — t- oo and that n = n^ — t- oo. 
Furthermore we will allow the parameters 5 = 5m and p = pm to depend on m, 
whereas a and v are kept fixed. For simplicity of notation the index m will be 
omitted for n, 5 and p. The most generic situation will be p — )■ 0, in which case 
/ —7- oo. However, theorems are formulated in the more general setting under 
the following assumption: 

Assumption (A): n — )■ oo, (5/ — )■ c G (0, oo], and ' — t- C, where 

0< C< oo. 

Remark 2.1. Under the model assumptions of [8] "signals on the verge of 
detectability" had to satisfy °^^ ' — ;• C G (0, oo), which yielded asymptotic 
power of the Bayes oracle within (0, 1). Here we are concerned with a different 
situation, where the alternative distribution for ^i is not necessarily normal and 
does not depend on p, but the number n of individuals increases to infinity. In 
this setting the role of u is taken by n. Compared to the assumptions in [8] the 
major difference is that we additionally consider the case — 2iLJl _;. (7 = g, 
which means that the asymptotic power of the Bayes oracle is equal to 1. This 
additional case covers the interesting scenario where sparsity is of the form 
p = m~^, /? > 0, log 5 = o(log m) and n G (m^i , mp"^), for any positive constants 

Cl < C2. 

The generic situation will be concerned with sparsity and with the loss ratio 
5 having no dominating influence on the asymptotic results. We formalize this 
in 

Assumption (B): n — )■ 00, p — )■ 0, log 5 = o(logp) and 5 bounded from above. 

If Assumption (B) holds, then ^^^ — t- C > is enough to guarantee that 

Assumption (A) is fulfilled. All theorems in Section 3 are formulated under 
Assumption (B). 
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The following assumption imposes a restriction on the measure v, which will 
be used throughout this mansucript. 

Assumption (C): Let T := ayC. We assume that there exists e > such 
that v has a positive bounded density p with respect to Lebesgue measure on 
[— T — e, —T + e] and [T — e, T + e]. In case of C = it is further assumed that 
p{0^) := limp(/i) and p{0'^) := limp(/i) both exist and are finite and positive. 

The following Lemma provides the asymptotic critical points of the Bayes 
rule for distributions v satisfying Assumption (C). 

Lemma 2.1. Let Assumptions (A) and (C) hold. Then the critical values 
converge with limits 

an — 7- —T and hn ^ T . 

The proof is given in Appendix 6.1. 

Notation: Throughout the paper we will make use of the following notation: 
Let Qn and hn be two sequences. Then (/„ ~ /i„ indicates that |^ — t- 1 as n — )■ oo. 
If ^n — ^ we write gn = On- 

The following Lemma 2.2 specifies the rate at which a„ and 6„ converge to 
zero in case of C = 0. 

Lemma 2.2. Let Assumptions (A) and (C) hold. If C = then the critical 
values of the Bayes oracle fulfill 



(2.5) ^ne 2^ ~ — 7T~P(0 



f^ 



and 



(2.6) Vne'"^ ~ -^—-—p{0^ 



fS 



The proof is given in Appendix 6.2. 



Remark 2.2. As shown in the proof of Lemma 2.2, the accuracy of the 
approximations provided in (2.5) and (2.6) depends on the asymptotic behavior 
of 6f and on the regularity of p in a neighborhood of 0. Assuming for example 
that p is one-sided Lipschitz (on both sides of 0) and that Sf is polynomially 
bounded one obtains that the ratio of the right and left-hand sides of (2.5) and 
(2.6) can be expressed as 1 + z„ with z„ = o{n~^''^ logn). 

Remark 2.3. The results of Lemmas 2.1 and 2.2 generalize the critical 
value of the Bayes rule specified in [7] and [8]. Note that for u ~ AA(0, r^) the 
"magnitude" of the true signal defined in [8] is given hy u = ^^. Thus, ac- 
cording to Lemma 2.1, for C > the Bayes classifier rejects the null hypothesis 
if 

>log{uf^5^){l + o„ 



nX^ 
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which agrees with the results of [8]. 

Next consider the case C = 0. For normal distribution ^j ~ AA(0,r^) it 
holds that p(O') = p— . Taking logarithms of (2.5) we obtain the accurate 
approximation 

"''' - 2 log (^S^^) + o„ = login f 5') + o„ 



0-2 \\/2^ ap{0-) 

and because of p(O^) = p{0^) the same relation holds for bn- 

To emphasize similarity with the results for normal scale mixture models 
from [8] we introduce the notation 

V ■= n5^f . 

Then according to Lemmas 2.1 and 2.2 the Bayes oracle threshold values 
satisfy 



(2.7) a„ = -a J (1 + o„) and 6„ = a J (1 + o„) . 

The risk for a multiple testing rule is computed under the additive loss of 
individual tests simply as the sum of the risks of individual tests. Note that for 
the specified mixture model (2.3) type I error ti and type II error t2 of fixed 
threshold rules are identical for each individual test. The corresponding risk is 
therefore defined as 

(2.8) R = Ri + R2 = m{l- p)ti6o + mpt25A ■ 

In the following theorem we compute the asymptotic risk R of the Bayes 
oracle. 

Theorem 2.1. Under Assumptions (A) and (C) the risk obtained by the 
Bayes rule (2.4) takes for C = the form 



(2.9) i?^ = mpdACT^I^ (p(0- ) + p(0+)) (1 + On) 

whereas for < C < oo 

(2.10) R^ = mpSA H-T,T){1 + o„) . 
The proof is given in Appendix 6.3. 



Definition: A multiple testing rule is called asymptotically Bayes optimal 
under sparsity (ABOS) if its risk R satisfies -^ — )■ 1 under the conditions of 
Assumption (A). 
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2.1. ABOS of fixed threshold rules. The next theorem describes which mul- 
tiple testing rules with fixed threshold are ABOS. 

Theorem 2.2. Consider the testing rule which rejects Hq^ if Xi falls out of 
the interval {an,bn), with a„ < and bn > 0. Under Assumptions (A) and (C) 
this rule is ABOS if and only if 



(2.11) 


nal. 


and — — = log V + z\) 




where 








(2.12) 


Za = o(logi;) 


, Zb = o(iogv), 




and 








(2.13) 


lim Za + 2 log log t; = oo 


, lim Zb + 2 log log V = 

n^oo 


: OO . 



The proof is given in Appendix 6.4. 

As a simple consequence of Theorem 2.2 we have 

Corollary 2.1. Suppose that additional to the assumptions of Theorem 
2.2 also Assumption (B) holds. If for n = n^ the sparsity assumption 

log( mp) 
2.14 mp^sG 0,00, . 7 , ,\ ^ , 

log(n/p^) 

is fulfilled, then thresholds of the form 

(2.15) clr^cl = log{nm^) + C, ? = o{log{n/p^)) 

yield multiple testing rules which are ABOS, whenever 2^ > — 21og(mp) + d for 
some arbitrary constant d. In particular this is the case when £, is a constant. 

Proof. Simply observe that z = log(nm^) + ^ — log(np^^(5^) fulfills the re- 
quirements of Theorem 2.2 under the assumption of the corollary. D 



Remark 2.4. Corollary 2.1 addresses the situation of extreme sparsity, 
where the number m of tests increases to infinity, but the expected number 
of true signals remains constant or increases only very slowly with m. If addi- 
tionally logn = o(logm) then Corollary 2.1 implies that the universal threshold 
21ogm of [18] is ABOS. This extends Remark 3.4 of [8] to the case where the 
distribution of ^i under the alternative is not necessarily normal and does not 
change with m, while the number of individuals n slowly increases with m. 
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2.2. BFDR controlling procedures. One of our main goals is to study ABOS 
of FDR controlling procedures like the popular Benjamini-Hochberg procedure 
(BH,[3]). As in [8] the main technical tool to prove ABOS is to approximate the 
random threshold of BH by the threshold from a rule controlling the Bayesian 
false discovery rate (BFDR, see [19]). For that reason we will start our discus- 
sion here with results on the asymptotic properties of BFDR rules for general 
distributions of ^Uj under the alternative. BFDR is defined as 



(2.16) SFZ)i? = P(i7oi is true] Foi was rejected) - ^^ ^'^^' 



(l - p)tii + p {I - t2i) 



where tu and t2i are the probabilities of the corresponding type I and type II 
errors. Consider a fixed threshold rule based on Xj with threshold values a < 
and 6 > 0. Then tu = ti, t2i = t2, and under the mixture model (2.3) 

ti = ^{^/^a/a) + 1 - <^{^/^b/a) 

and 

t2 = 1 - fi'^iV^ia - fi)/cj) + 1 - $(V^(6 - fi)/a))diyip,) . 



To obtain threshold values a^ < and 6„ > with BFDR level a we have to 
®°^^^ (i-p)ti"+p(i-t2) = «' oi' equivalently 

a ^{^/Ea^/a) + 1 - ^{^/Eb^/a) 

/(I -a)- jmV^ia^ - f^)/a) + 1 - HV^{b^, - li)/a))du{f^) ' 

We will restrict our attention to rules based on symmetric thresholds, such that 
a^ = —h^, and use 



(2.17) c| = 4(n):- - 



2 _ 2 / \ ._ V " 



n[at^ ] n {b^ 






to denote the corresponding threshold for the scaled test statistics Z., 
Then cb satisfies the following equation 

(2.18) ° - 2(1-4.(0,)) 



/(I -a) 2- j\^(cB^^/n^x|a)^^(cB- ^fniila)\dv{ii) 

R 

As shown in Lemma 6.4 in Appendix 6.5, a G (0, \—p) guarantees existence and 
uniqueness of a solution cb for (2.18). The following theorem provides conditions 
on a, for which the BFDR controlling rule specified in (2.18) is ABOS. 

Theorem 2.3. Additional to Assumptions (A) and (C) suppose that 
a G (0, 1 — p), Q — )■ Ooo < 1; o-nd 

log (I 

(2.19) f/a -^ oo, ^^ ^ Co < oo , 

n 
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where Cq is such that iy{—a\/2CQ, cj-v/2Co) < 1 and v has no atoms at ibo"v^2Co. 
The threshold value cb of the rule controlling BFDR at level a is then given by 

(2.20) 4^21„g(/)-l„,(21„,(/))+21og(^^lilZ|2d)+,. , 

vuhere 



Ci = 1 - iyi-ay/2Co, a^/2Co) . 
The BFDR controlling rule is ABOS if and only if 

(2.21) log(/^V^) ^ ^ ^^^ 21og(a5V^)-loglog(//a)^-oo 

log(//a) 

In that case Cq = C/2 and therefore Ci = 1 — u{—T,T). 
The proof is given in Appendix 6.6. 



Corollary 2.2. If in addition to the assumptions of Theorem 6.7 also 
Assumption (B) holds then the fixed threshold rule with BFDR at the level 
a oc n-i/2 is ABOS. 

Corollary 2.3. If in addition to the assumptions of Theorem 6.7 and 
Assumption (B) also (5 — t- and n oc — logp then the fixed threshold rule with 
BFDR equal to a G (0, 1) is ABOS. It is not possible that a BFDR controlling 
rule is ABOS when both a and 5 are constant. 

Remark 2.5. Based on (2.20) straight forward calculations yield the asymp- 
totic type I error of the BFDR rule 

(2.22) tf = ^^" (l + o„)- 

(1 -CVoo)f 

The BFDR controlling rules discussed in this section require the knowledge 
of some of the parameters of the unknown mixture distribution and therefore 
they are not applicable in practice. However, the results on ABOS of the BFDR 
controlling rules can be used to prove ABOS of some popularly used multiple 
testing rules, like the Bonferroni correction or the Benjamini-Hochberg pro- 
cedure. Asymptotic optimality results of these rules will be presented in the 
following sections. 

2.3. Bonferroni correction. In applied sciences the most popular multiple 
testing procedure is still the fixed threshold rule of Bonferroni correction. In 
our setting its critical value cson for the test statistic ^^ — - is defined by 

1 - '^(cBon) = -^ ■ 
2m 

The procedure controls the family wise error rate at level a. The following 
lemma specifies the conditions for a under which the Bonferroni procedure is 
ABOS. 
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Lemma 2.3. Suppose Assumptions (A), (C) and sparsity condition (2.14) 
hold. The Bonferroni procedure at FWER level an is ABOS if an satisfies the 
assumptions of Theorem 6.7. 

Proof. If m — )■ oo then the threshold for the Bonferroni correction can be 
written as 

cL„ = 2 log (^) - log (2 log (^)) + log(2/7r) + o„ . 

Comparison of this threshold with the asymptotic approximation to an optimal 
BFDR control rule (2.17) and (2.20) yields 

(?Bon = c| + 2 log mp + On{l) ■ 

From (2.14) it follows easily that c^^^ = c\{l + o„). By assumption, the rule 
based on the threshold c^ is optimal, and hence c^^^ satisfies condition (2.12) 
of Theorem 2.2. Condition (2.13) is satisfied, since by assumption \ogmp is 
bounded from below and thus ABOS of the Bonferroni correction follows. D 

2.4. FDR controlling procedures. The Benjamini-Hochberg rule [3], which 
we will also call step-up FDR controlling procedure, is defined as follows: For 
the square of the scaled test statistics Zf = — •j- one computes two-sided p- 
values Pi = 2(1 — $(|Zj|)) which are then ordered pji] < p[2] < • • • < P[m]- For 
the step-up procedure at the FDR level a compute 



(2.23) kp '■= max < i : pjjj < — > 



and reject the kp hypothesis with p- values smaller or equal P[kp]- In view of 
the proof of ABOS for FDR controlling model selection criteria in Section 3.2 
we will not only consider the step-up procedure, but also the corresponding 
step-down procedure at level a. For this compute 

(2.24) /cG:=min|i: p[,]>^| 

and reject the kc — I hypotheses with p-values smaller than p\kaY It is well 
known, that in practice both procedures behave very similar (see [1]). 

Optimality results for the step-up FDR controlling rule were proven in [8] 
under the assumption of jn being normally distributed. A crucial step was the 
definition of a random threshold for the BH rule 

CBH = min{cBon,CBH} • 
with 

t 1 - F„,{y) 
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Fig 1. Comparison of the random thresholds cbh and csd with the nonrandom threshold cqw ■ 
In the legend F\ refers to Fm and F2 refers to F. 



Here 1 — Fm{y) = #{\Zi\ > y}/m. Alternatively let us denote 1 — Fmiu) = 
^{\Zi\ > y}/m. Similar as in case of BH it is easy to check that SD rejects the 

where 



null hypothesis Hoi if and only if Zf > (? 



SD 



(2.26) 



CSD = sup < y : 



2(1 -^(y)) 
1 - Fm{y) + 1/m 



> a 



It was proven by Genovese and Wasserniann (GW) in [25] that for fixed p, as 
the number of tests increases, the random threshold cbh can be approximated 
by the non-random threshold 



(2.27) CGw ■■ 

where F{y) = P{\Zi\ < y). 



2(1 -^CGw)) 
1 - F{cGw) 



a 



Figure 1 illustrates the thresholds cbh, csd and cqw- Comparing cbh and 
Csd with CGW the major change is in replacing the cumulative distribution 
function of \Zi\ by the corresponding empirical distribution function. In [8] 
it was shown that also in case of sparsity cbh can be well approximated by 
cgw^ and in Lemma 6.5 of Appendix 6.7 we will see that the same is true for 
CSD- A much simpler result is that under sparsity the difference between cgw 
and the corresponding BFDR controlling threshold cb becomes asymptotically 
negligible. 

Theorem 2.4. Suppose Assumptions (A) and (C) are true and that p — )• 0. 

nX? 



Consider the rule rejecting the null hypothesis Hqi if 



>c: 



GW 



This rule is 



ABOS if and only if the corresponding BFDR controlling rule defined in (2.18) 
(for the same a = a^) is ABOS. In this case we have 



2 

c-GW 



Cb + 0n 



Proof. The proof of this statement follows exactly as the proof of Theorem 

4.2 of [8]. D 
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The next theorem provides the optimahty result of BH and SD for generally 
distributed effect sizes under the alternative. 

Theorem 2.5. Apart from Assumptions (A) and (C) assume that 

(2.28) mp^se (0, oo] 
and 

(2.29) a satisfies the conditions of Theorem 6.1, 
i.e the BFDR control rule at level a is asymptotically optimal. 

For the denser case 

log'^'i m 

(2.30) p > , for some constant 71 > 1 

m 

the additional assumptions 

f-,'i^\ ^ 12 t ^n ^ loglogm 

(2.31) n Sm'"^, for some 72 > and — > U 

log(p a) 

should hold. Then both BH and SD are ABOS. 

Proof. BH is more liberal than SD, thus it is enough to control the risk 
contribution of Type 1 error for BH, as well as the risk contribution of Type 2 
error for SD. Under the first condition in (2.31) the proof for Type 1 error of 
BH follows along the same lines as the proof of Lemma 5.4 in [8]. Also, under 
the condition of extreme sparsity (2.14) according to Lemma 2.3 the Bonferroni 
procedure is ABOS. Therefore the optimality of the type H error component of 
the risk of SD in the extremely sparse case follows directly from a comparison 
with the more conservative Bonferroni correction. Finally, the necessary bound 
of the type H error component of the risk of SD for the denser case (2.30) is 
provided in Appendix 6.7. This proof substantially relies on the second condi- 
tion in (2.31). D 



Remark 2.6. The upper bound on m provided in the second condition of 

(2.31) is not very restrictive. Specifically, it is satisfied whenever p oc m"" with 
/3 G (0, 1]. For p decreasing to at a slower rate (for example like (logm)~^) 
one can replace this bound with the condition 

(2.32) n > nn?'^ for some 73 > . 

(It is easy to show that (2.32) implies the upper bound on m in (2.31) given 
the other assumptions of Theorem 2.5) . 

The following Corollaries are easy consequences of Theorem 2.5. 

Corollary 2.4. Suppose Assumptions (A) and (C) hold. If p = m~^ with 
j3 G (0, 1], n < m"'^ for some 72 > and S is bounded from above such that 
log 6 = o(logm) then BH and SD at FDR level a oc n^^'^ are ABOS. 
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Corollary 2.5. Suppose Assumptions (A) and (C) hold. If p = m"^ with 
(3 G (0,1], n oc logm and 5 converges to zero such that log 5 = o(logm) then 
BH and SD at a fixed FDR level a G (0, 1) are ABOS. 

Remark 2.7. Corollary (2.4) states that under some mild restrictions on 5 
BH and SD at the FDR level a oc n^^/^ are ABOS. CoroUary (2.5) says that 
in case when n oc log?7i then under the additional requirement that 5 — )■ 0, 
BH and SD at the fixed FDR level a G (0, 1) are also ABOS. This result 
substantially extends the results of [8] to the case where the prior on ^j is 
fixed and not necessarily normal, while the sample size n slowly increases to 
infinity. This additionally justifies the use of the fixed FDR level for BH in 
many applications, like e.g. in bioinformatics, where n is much smaller than m. 
As discussed in [8] the condition (5 — )■ is quite reasonable in this context, since 
the cost of missing a true positive is usually large if p is very small. 

3. ABOS in the context of multiple regression. It is well known [8, 23] 
that there is a strong connection between model selection for multiple regression 
and multiple testing rules. Under the simplified assumption of an orthogonal 
design matrix and known variance of the error term the two problems actually 
become identical. Consider a multiple linear regression model 

^nxl = A„x(m+l)P(m+l)xl + ^nxl ) 

where the first column in the design matrix consists of ones and e '^ A^ (O, cr^Inxn) ■ 
Let us additionally assume that 

(3.33) X X = nl(^rn+l)x{m+l) i 

and that the regression coefficients Pi, ... , (3m can be modelled as independent 
random variables from the following mixture distribution 

(3.34) {l-p)do+pu. 

Under the assumptions (3.33) and (3.34) least squares estimates f3i, 1 < i < m, 
are independent random variables from the mixture distribution 



(3.35) {l-p)N \Q,— \+p\v*N \Q,— 

This is identical with (2.3) and thus the problem of detecting true regressors is 
equivalent to the multiple testing problem. Therefore, in case when each false 
positive (falsely detected regressor) induces the cost 5q and each false negative 
induces the cost 5a, thresholds of the Bayes rule and the optimal Bayes risk are 
obtained just like in Lemma 2.2 and in Theorem 2.1. 

As mentioned in the introduction we will focus here on the case of Assumption 
(B), where the loss ratio has no particular influence on the asymptotic results. 

In this case — 2&i_ = 5SP(i_)_o^)_ "We also consider only sparsity parameters 

p — )■ satisfying assumption (2.28). Since under orthogonal designs m < n 
one has — logp = O(logm) = O(logn), and finally '^^^ — ~^^^^ ~^ 0- Thus, 
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under orthogonal designs assumptions (B) and (2.28) imply Assumption (A) 
with C = 0. Therefore we will refrain from referring to Assumption (A) in this 
section. 

We will first discuss a model selection criterion which is ABOS in case of 
extreme sparsity (2.14), as in Corollary 2.1. However, it is easy to see that for 
m < n that sparsity assumption reduces to 

ri QR^ . ^ (n 1 log("ip) ^ „ 

(3.36) mn — > s G (0, oo , — ; ^0. 

logn 

3.1. ABOS of niBIC when a is known. It was shown in [5] in the context 
of QTL mapping that for large m classical model selection criteria like AIC or 
BIC tend to select too large models. Based on Bayesian ideas a modified version 
of BIC (mBIC) was proposed to take into account the number of available 
regressors. When a is known the mBIC criterion suggests choosing the model 
M for which 

(3.37) :^^^^ + A;(logn + 21ogm + d) 

obtains a minimum, where RSSm refers to the residual sum of squares for 
model M, k = k{M) is the number of regressors in the model and d is a certain 
constant. A comprehensive introduction into the ideas leading to mBIC is given 
in [9]. 

Remark 3.1. It follows from the derivation of mBIC that from a Bayesian 
perspective exp(— (i/2) is the a priori expected number of regressors. If there 
is no prior knowledge on the model size the recommended standard choice is 
d = —2 log (4), which guarantees control of FWER at level 0.1 for n > 200 and 
m > 10. For further details see [9]. 

Apart from ABOS we want to show consistency of mBIC. 

Definition. A model selection rule is said to be consistent if the probability of 
selecting the true model converges to 1 as m — )■ cxd. 

Theorem 3.1. Consider the orthogonal regression model specified by the 
conditions (3.33) and (3.34) o'^^ ^^i assumptions (B) and (C) (with C=0) hold. 
Under (3.36) mBIC is ABOS, while under the considerably weaker assumption 



/log n 

(3.38) mp — )• s G (0,ool, mp\ )■ 

V n 

mBIC is consistent. 

Proof. It is easy to check that under assumption (3.33) mBIC suggests 
choosing those regressors for which 

n/32 

— ^ > log n + 2 log m + d . 
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From Corollary 2.1 one immediately concludes that under the sparsity assump- 
tion (3.36) this selection rule is ABOS. 

To prove consistency of mBIC let the random variable Mj be Bernoulli dis- 
tributed where a misclassification of predictor Xj denotes a success. If ti and 
^2 denote the probability of type I and type II error of mBIC, respectively, then 
for sufficiently large n 



p{Mj = 1) = (1 - p)ti + pt2 <KpJ ^ 



n 



for some constant K, where the last inequality is shown in Appendix 6.8. Using 
Markov's inequality the probability of picking the wrong model (which is the 
probability of at least one wrong misclassification) can thus be bounded like 



(3.39) P(^ Mj>l)<Ei^Mj^ < Kmp 

i=i Vi=i 



logn 



n 



which according to (3.38) converges to 0. D 



Remark 3.2. Theorem 3.1 addresses the situation of sparsity, where the 
expected number of true signals remains constant or slowly increases with m. 
The assumption mp — )■ s < oo was used when deriving the mBIC penalty in 
[5]. Theorem 3.1 actually tells us that mBIC remains optimal when the number 
of true signals is mildly growing, for example mp = log m is still conceivable. 
This scenario might be more realistic in many applications, where one would 
hope that by increasing the number of markers one could actually be able to 
detect more true signals. However, the situation described is still very sparse, 
which is one motivation to introduce in Section 3.2 criteria which are slightly 
less restrictive. 

Remark 3.3. Note that under the assumption mp -^ s < oo the expected 
value of the number of false positives EP produced by the standard BIC is 
equal to EP = ?7i(l — p)ti = , ™ (1 -|- On,m)- Thus BIC is not consistent 
when lim„ .f^ , " > 0. 

Remark 3.4. Another interesting situation arises for distributions u for 
which there exists an open interval including such that u{—l^r) = (cf. [29]). 
It can be shown that in this situation the mBIC rule is not optimal anymore, 
although its risk still converges to 0. 

3.2. Modifications of BIC controlling FDR. As shown in [9] there exists a 
close connection between mBIC penalty and the Bonferroni correction for mul- 
tiple testing. In a recent paper [1] Abramovich et al. have been discussing exten- 
sively penalized model selection schemes which control the false discovery rate. 
Their starting point is the close relationship between step-up and step-down 
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FDR controlling procedures at level a and the following penalizing scheme: For 
models of size k define the selection criterion 

(3.40) :^ + X:4M/2m), 

i=\ 

where QNiv) is the (1 — r/) - quantile of the standard normal distribution. It 
can be shown quite easily that the size of models selected by this procedure is 
larger or equal kc and smaller or equal kp (see [1]). The procedure is therefore 
nested between BH and SD, and from Theorem 2.5 it immediately follows that 
it is also ABOS. 

We will adopt approximations of the FDR penalization (3.40) to amend BIC. 
A simple argument involving the normal tail approximation shows that 

q%{al/2m) ~ 2log{m/l) - log[21og(m/Q0] + log(2/7r) - 2 log a . 

In view of Corollary 2.2 we are mainly interested in the case where a oc n^^'^ 
which leads to the criterion 

k 

(3.41) mBICl: ^'^f^ + fc(log(nm^) + di) - 2 log(fc!) - V log log(nmV^^) • 

1=1 

Here the constant d\ can be chosen appropriately to control FDR at a given 
level. Neglecting the last term of the mBICl penalty, which is of a lower order 
than the two preceding terms, leads to the following simplified form of (3.41), 

(3.42) mBIC2: ?^^ + kiXoginm^) + ds) - 2 log(A;!) . 

This might be thought of as a first order approximation of the FDR penalization, 
whereas mBICl is a second order approximation. Interestingly, the penalty 
in mBIC2 is very similar to a modification of RIC introduced in [26], with 
additional penalty term 

k 

2 NJ log(m/i) = k log(m^) — 2 log(/c!) , 

i=\ 

which was motivated by an empirical Bayes approach. 

Abramovich et al. consider in [1] the approximation X]i=i '?Ar('^V2m) ~ 
kq^{ak/2m), which can be justified by using the Sterling approximation for 
k\. The resulting first order criterion has the form 

(3.43) mBIC3: ?^^ + kiloginn?) + d^) - 2k log(k) . 

Compared with (3.42) this means essentially that log(fc!) is substituted by 
log{k''). 
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Remark 3.5. In the simulation study of Section 4 the constant of mBICl 
is chosen as di = 0, which guarantees control of FDR at a level below 0.06 
for sample size n larger than 200. For mBIC2 the constant 8,2 = — 21og(4) is 
used, which coincides with the recommended standard choice of d for niBIC. 
For moderate ra and n as in the simulation study mBICl with di = and 
mBIC2 with d2 = — 21og(4) have rather similar penalties for small k. In case of 
mBIC3 the Sterling approximation leads to d^ = d2 + 2. 

Theorem 3.2. Consider the orthogonal regression model specified by the 
conditions (3.33) and (3.34)- Let Assumptions (B) and (C) as well as (2.28) 
he true. For the denser case (2.30) the additional condition (2.31) is assumed 
to hold. Then the rules mBICl, mBIC2 and mBIC3 are ABOS. The rules are 
consistent under the additional assumption (3.38). 

The proof is given in Appendix 6.9. 



Remark 3.6. The FDR controlling selection rules mBICl - mBIC3 are 
ABOS under much less restrictions on the sparsity levels than mBIC. However, 
conditions for consistency are exactly the same. Actually given the other as- 
sumptions of Theorem 3.2 it follows that (3.38) is also necessary for the Bayes 
rule to be consistent. 

3.3. ABOS of mBIC when a is unknown. We have seen that for known 
a and under the simplifying assumption of an orthogonal design matrix, the 
problem of model selection using mBIC in multiple regression is equivalent to 
multiple testing, in the sense that a regressor is included in the model chosen 
by mBIC if and only if the corresponding square of the sample regression coef- 
ficient is larger than a fixed threshold. In case of unknown a the situation gets 
much more complicated and no such direct connection with multiple testing 
can be established. We are only interested in the comparison of models which 
include the intercept. In this case the Bayesian Information Criterion chooses 
that model which minimizes BIG = n log RSSm + k log n. The corresponding 
version of mBIC becomes 

(3.44) mBIC = nlogRSSM + H^ogn + 2logm + d) . 

Our main goal is to show that also in case of unknown a mBIC is asymptotically 
optimal. 

Some problem occurs when (3.44) is used as a selection criterion for very 
large models. To be able to estimate the parameters of a model M we need the 
restriction that k < n — 2. But if k is getting close to n then overfitting will lead 
to extremely small log RSSm, and the global minimum of (3.44) is likely to be 
attained by models of maximum size k = n — 2 (if that many regressors are 
available). It can be ruled out that such models are correct under the assumption 
of sparsity. To cope with this pathology we will restrict L, the maximal number 
of regressors to be allowed in addition to the common intercept term, by 

(3.45) L = a { — -7T- ) as n — > cx) . 

\ (log n-\- 2 log m)^ log m ) 

imsart-imsgeneric ver. 2009/08/13 file: AB0S2_arxiv.tex date: July 13, 2011 



F. FROMMLET ET AL./ABOS FOR GENERAL DISTRIBUTION 19 

On the other hand to bound the type II error it is necessary to search among 
sufficiently large models, and we require the lower bound 



(3.46) L > mp{logn) ^^' for some ry > and all sufficiently large 



n 



Theorem 3.3. Suppose as in Theorem 3.1 that Assumptions (B) and (C), 
(3.33), (3.34) hold. Furtherm,ore assum,e that (3.45) and (3.46) are true. Then 
the mBIC criterion (3.44) ^■s ABOS under (3.36), and consistent under (3.38). 

The somewhat lengthy proof of this theorem is provided in Appendix 6.10. 

Remark 3.7. Note that except for the conditions (3.45) and (3.46) on the 
potential model size L the assumptions for ABOS of mBIC in case of unknown 
a are exactly the same as in Theorem 3.1 for known a. We conjecture that 
similarly the results of Theorem 3.2 concerning ABOS of the FDR controlling 
modifications of BIG should also hold in case of unknown a. However, the 
techniques used for the proof of Theorem 3.3 cannot easily be extended to 
mBICl - mBIC3. We will come back to this point in the simulation study in 
the next section. 

4. Simulation results. We employ computer simulations to investigate 
the performance of the proposed model selection rules for multiple regression. 
For the sake of simple notation in this section m denotes the number of regres- 
sors plus intercept. We use orthogonal designs with n = m, where the design 
matrices X^xm are chosen as Hadamard matrices, whose elements are equal 
to 1 or -1. For each of the simulation runs the number of nonzero regression 
coefficients k* was simulated from a binomial distribution B(m,p). Then the 
values of nonzero coefficients /3i , . . . , /3fc* were simulated from a normal distri- 
bution A^(0, r^), with r^ = 0.9. Finally the values of the response variable were 
simulated according to the multiple regression model 

k* 

Yi = 2_^f3jXij + ej , 
i=i 

where €j ~ A^(0, 1). The specific value of the variance of regression coefficients 
r^ = 0.9 is selected in such a way that the power of the Bayes oracle for m = 256 
is in the range between 50% and 60%. This choice allows to assess differences 
in performance of the considered model selection rules. 

In the first part of the simulation study we consider sparsity parameters 
p G {0.001,0.005,0.01,0.02,0.05,0.1,0.2} and simulate for m = 256 as well as 
m = 1024. In the second part we will look at a wider range of sample sizes 
n = m ^ {128,256,512,1024,2048,4096}, while the sparsity parameters are 
computed according to p (x m~^ for four different levels /3 G {1, 1/2, 1/4, 1/8}. 

We compared the following model selection criteria: 

1. The Bayes Oracle (2.4) with 6q = 6a- This oracle is aimed at minimizing 
the expected number of wrongly classified regressors and in our setting 
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includes those explanatory variables for which 

(4.47) n/32 > I^l!±l ("log (nr2 + l) + 2 log (^ 

2. Modified versions of Bayesian information criterion: 

(a) niBIC: (3.37) with d = -2 log 4 

(b) niBICl: (3.41) with di = 

(c) mBIC2: (3.42) with da = -2 log 4 

(d) mBIC3: (3.43) with dg = -21og4 + 2 

The values of the constants are chosen according to Remark 3.1 and Re- 
mark 3.5. 

3. Step up and step down FDR controlling procedures, (2.23) and (2.24) 
at FDR levels a = 0.05. These procedures test individually each of the 
regression coefficients based on simple regression models. 

Modified versions of BIC and FDR controlling procedures are investigated 
under two scenarios: when a is known and when it is unknown. In case when 
a is unknown modified versions of BIC are based on nlogRSSM instead of 
— j^ (see (3.37) and (3.44)). For unknown a the FDR controlling procedures 
are based on t-tests instead of z-tests. 

To identify the regression models, which are "best" with respect to our model 
selection criteria, we start with ordering explanatory variables based on the re- 
sults of simple regression t-tests. This procedure gives us the proper sequence 
of nested models, since under the orthogonal design the estimate of a regression 
coefficient for a given explanatory variable does not depend on the other regres- 
sors included in the model. Then we compare values of model selection criteria 
for these nested models, starting from the null model, with no explanatory vari- 
ables, and ending with a model of dimension kmax = 0.3r7T,. The need for using 
the bound on the maximal number of components in the considered models 
results from the fact that under our design the residual sum of squares for the 
full model is equal to 0. Therefore, in case of unknown a, all modified versions 
of BIC are optimized by the full model (see the discussion before introducing 
assumption (3.45)). Despite of this, according to Theorem 3.3 and the results 
of [14] on the consistency of similar model selection rules, we expect that our 
model selection criteria are consistent if the true design is sparse and kmax goes 
to infinity at a slower rate than m. The choice kmax = 0.3m corresponds to the 
expected upper bound of model sizes for the sparsity level p = 0.2. 

For all considered procedures we report several characteristics, which are 
calculated based on 10000 replicates. For each of these replications we compute 
the number of chosen variables that do not appear in the true model (false 
positives, FP) and the number of true regressors which were not detected (false 
negatives, FN). These values are used to calculate the following statistics: 

1. Misclassification probability: MP = (FP + FN)/(m - 1). 

2. False discovery rate: FDR = pp 1 ^^_pj^ , or in case of no discoveries. 

3. Power = — ^5 — (cases for which k* = are excluded from this analysis). 
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For each scenario the values of MP, FDR and Power are averaged over aU 10000 
simulations. 

4.1. First part of Simulation. The results of this part of the simulation study 
are illustrated in Figure 3 and Figure 4 in Appendix 6.11. Figure 3 presents the 
graphs of our computed characteristics as functions of the sparsity parameter 
p in case of known a. The two plots (a) and (b) of the first line show that, 
as expected, the Bayes oracle has the lowest misclassification probability MP. 
However, the differences in MP between the Bayes oracle and FDR controlling 
procedures, as well as mBICl-mBIC3, are hardly observable. For p < 0.05 also 
MP of mBIC is comparable to and sometimes even better than MP of other 
criteria. However, for p = 0.05 differences become observable, and for p > 
0.05 MP of mBIC substantially exceeds the values reported for other methods. 
Qualitatively there is no different behavior in the plots for m = 256 and m = 
1024, though it is clear that MP gets smaller for larger sample size. These 
observations agree well with our results on the asymptotic optimality of mBIC 
in case of extreme sparsity, and of the FDR controlling procedures and mBICl- 
mBICS in a wider range of sparsity levels. Apparently our asymptotic analysis 
describes the situation already quite well for m = 256. 

Plots (c) and (d) of Figure 3 show the FDR of different procedures. FDR 
of the Bayes oracle increases from for p = to 0.08 for p = 0.2 in case of 
m = 256, and to 0.03 for m = 1024. As expected, FDR of both step up and step 
down multiple testing procedures slowly decreases from approximately 0.05 for 
j3 = to 0.04 for p = 0.2 independently of the sample size. The same pattern is 
observed for the first modified version of BIC aimed at controlling FDR, mBICl. 
For m = 256 its FDR behaves almost identical to BH, whereas for m = 1024 
FDR starts at 0.03 and decreases to 0.02. FDR of mBIC2 and mBIC3 behave 
quite differently in case of extreme sparsity. Due to the choice of constants di 
and ^2, FDR of mBIC2 is close to FDR of mBICl for small p. In contrast 
mBIC3 has extremely small FDR for p close to 0, which is due to the fact that 
for small k Sterling's approximation is not valid. For larger p (resulting in the 
choice of larger models) mBIC2 and mBIC3 behave more and more similar, and 
their FDR stabilizes at a level of approximately 0.05 for m = 256 and at 0.025 
for m = 1024, being thus slightly larger than FDR of mBICl. Finally FDR of 
the modified version of BIC aimed at controlling the Family Wise Error Rate, 
mBIC, quickly decreases; for m = 256 from approximately 0.043 for p = to 
0.0015 for p = 0.2, and for m = 1024 from approximately 0.015 down to 0.001. 

The pattern of the graphs (e) and (f ) for power corresponds to the behavior 
of FDR. At p = 0.001 clearly the Bayes oracle has smallest power. In case of 
m = 256 for p > 0.01 the power of the Bayes oracle exceeds the power of other 
model selection criteria, whereas for m = 1024 BH and SD have largest power. 
However, the differences of power between all criteria apart from mBIC are very 
small and for p > 0.001 do not exceed 4%. Also, it is interesting to observe that 
the power of these criteria slowly increases with p. mBIC performs substantially 
different than the other methods. Its power is significantly smaller and remains 
constant as a function of p. Graphs (e) and (f ) illustrate also that as expected 
power increases with sample size. 
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In Figure 4 the results for unknown a are reported. The most obvious dif- 
ference between the case of known and unknown a is observed for the multiple 
testing procedures based on simple regression tests. FDR of these procedures 
is close to the nominal level of 0.05 only when p is very close to 0. For larger 
values of p other important regressors inflate the residual error in simple regres- 
sion tests, which leads to a very low power, low FDR and large misclassification 
rate. As a consequence, when a is unknown simple regression tests perform sub- 
stantially worse than other methods based on model selection strategies. This 
finding has been discussed extensively in [24] in the context of genome wide 
association studies. 

Concerning modified versions of BIC the performance of mBICl-mBIC3 is 
only slightly affected by the fact that a is unknown when p < 0.1. However, for 
p = 0.2 and m = 256 one observes a significant increase of FDR and MP when 
compared to the known a case. In particular mBIC2 and mBIC3 have a sudden 
increase of FDR which results also in a significantly larger MP than that of the 
Bayes rule. niBICl suffers from the same problem, though to a lesser extent. 
Thus for larger p the second order approximation in mBICl proves beneficial. 

While mBIC2 and mBIC3 are getting for larger p too liberal, mBIC has the 
opposite tendency. Especially for m = 256 the fact that a is unknown leads to 
a substantial decrease of power and FDR for large values of p. For m = 1024 
the relative performance of mBIC substantially improves and is only slightly 
worse than for known a. However, both in terms of power and MP mBIC is 
still performing much worse than mBICl - mBIC3. 

4.2. Second part of Simulation. Here we want to assess numerically the 
asymptotic behavior which was analyzed theoretically in Section 3. To this end 
we will perform similar computations as above, but consider the wider range of 
sample sizes n = m £ {128,256,512, 1024,2048,4096}. The sparsity parameter 
is computed as p = cpm~l^ , where we analyze the extremely sparse case /3 = 1 
as well as /3 G {1/2, 1/4, 1/8}. For each scenario the factor cp is chosen such that 
for m = 128 we always have p = 0.125. The misclassification probability for the 
four different scenarios and for the various methods are provided in Figure 2. 
We no longer consider SD, as it has been seen before to behave more or less 
identical with BH. We also present here only the case of unknown o", which is 
of particular interest in view of the unproven conjecture that mBICl - mBIC3 
will be ABOS for a wider range of sparsity levels than mBIC. 

For m = 128 (and p = 0.125) mBICl has lower misclassification rate than all 
other criteria. mBIC2 and mBIC3 have relatively large misclassification rate, 
and are performing worse than mBIC. We had seen this behavior before already 
for m = 256 and p = 0.2. If there are relatively many true signals and ra is 
small then mBIC2 and mBIC3 tend to be slightly too liberal. 

For /3 = 1 the misclassification rate of all procedures converges towards that 
of the optimal Bayes rule. In particular it is confirmed that mBIC is ABOS in 
case of extreme sparsity, although mBICl - mBIC3 perform even better. In case 
of extreme sparsity it seems that even BH behaves relatively well. For smaller /3 
a multiple testing approach is not suitable in case of unknown a as we discussed 
already above. 
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Fig 2. Asymptotic behavior of the misclassification rate MP at sparsity p cc m ^ for different 
values of p. 
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The smaller /3, the poorer becomes the performance of mBIC. Although it 
seems that its misclassificatioii rate still converges towards that of the Bayes 
rule, this is only true in absolute terms. Already for f3 = 1/2 the ratio of 
the misclassification rates between BH and the Bayes rule remains more or 
less constant at 1.2. For f3 = 1/8 this ratio is actually growing, and mBIC 
is certainly not optimal. On the other hand MP of mBICl - mBIC3 rapidly 
converges towards MP of the Bayes rule in all four scenarios, which supports 
our conjecture that an analogue of Theorem 3.2 should also hold in case of 
unknown a. Finally Figure 2 suggests that regardless of the sparsity level /3 all 
modifications of BIC are consistent selection rules in the asymptotic framework 
of Assumption B. 

5. Discussion. The first part of this paper generalizes optimality results 
of [8] for multiple testing procedures. Instead of scaled normal distributions we 
consider models of a larger class of distributions under the alternative. Only 
point null hypotheses are considered and the measure under the alternative is 
kept fixed. The asymptotics is thus not driven by a scaling parameter which 
determines the effect size, but rather by the sample size n which is assumed to 
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become large. In that context we study two situations: The "verge of detectabil- 
ity" case as in [8] , where the power of the Bayes oracle is positive but less than 
1. In this article the notion of "the verge of detectability" is extended to the 
practically important case where the distribution of the effect size is fixed and 
the sample size n slowly increases with the number of tests m. When sparsity is 
of the form p oc m^^ and the ratio of losses 6 is bounded from above, then the 
"verge of detectability" is obtained when n grows proportionally to log m. The 
second analyzed case is concerned with asymptotic power equal to 1, which is 
naturally associated with the situation where n grows faster than logm. 

In both cases all optimality results of [8] could be proved for the considered 
general class of distributions, where in the second case the analysis is slightly 
more involved and some additional mild restrictions on the asymptotic behavior 
of the loss ratio S are necessary. In particular it was shown that the Bonferroni 
selection rule is ABOS in case of extreme sparsity, whereas the Benjamini- 
Hochberg rule is ABOS under a much wider range of sparsity levels. Thus 
results of [8] have been extended to many practically important cases, where 
the distribution of the true effects is not symmetric. A new result is that the 
step down version of the FDR - controlling procedure is ABOS under almost 
the same conditions as BH. 

Optimality results were then transferred into the context of linear regression. 
The simplest situation is concerned with orthogonal regressors and known er- 
ror variance a^, where optimality results from multiple testing can be directly 
applied. We analyzed the performance of mBIC, a modification of BIG which 
was previously introduced for model selection in high dimensional data [5], and 
which is known to control the family wise error rate under the given conditions 
[9]. It turns out that mBIC is ABOS in case of extreme sparsity, namely under 
the same conditions as the Bonferroni selection rule for multiple testing. Addi- 
tionally three different FDR-controlling modifications of BIG were introduced. 
Optimality results for these selection rules, niBIGl - mBIG3, entirely corre- 
spond to results for the step up and step down FDR controlling procedures in 
multiple testing. Thus mBIGl - mBIG3 are ABOS under a much wider range 
of sparsity levels than mBIG. All modified versions of BIG (including mBIG) 
are consistent under the same assumption on sparsity levels which guarantee 
consistency of the Bayes oracle. 

Next we showed ABOS of mBIC under extreme sparsity in case of unknown 
a, a situation which is technically much more demanding than the previous 
case of known a. We conjecture that in analogy to the known a case, mBIGl 
- mBIG3 should be ABOS when removing the extreme sparsity restriction. 
While we were not able to give a formal proof, simulation results strongly 
support this conjecture. Furthermore mBIG in case of unknown a is consistent 
under the same conditions on sparsity levels under which the Bayes oracle is 
consistent. The same is expected to hold for mBIGl - mBIGS. Apart from our 
simulation study, consistency of the modified versions of BIG for unknown a 
can also be conjectured based on recent consistency results for the extended 
version of Bayesian Information Griterion, EBIG, reported in [13] and [14]. As 
discussed in [40] , if the dimension of the maximal allowable model kmax satisfies 
kmax/'m — ^ oo then mBIG2 is asymptotically equivalent to the standard version 
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of EBIC, based on a uniform prior on the model dimension. It follows that 
mBIC2 can be interpreted as an approximation of the Bayesian rule, in which 
the prior on the true number of regressors is uniform over the set {0, . . . , k^ax}, 
with kmax = o{m). 

The results presented in this article are important to understand optimality 
of model selection criteria under sparsity. However, they are somewhat prelim- 
inary as they are only considering the case of orthogonal regressors. In most 
applications where sparsity is an issue one is also dealing with m > n, that is 
the number of regressors exceeds the sample size. Our current analysis is explic- 
itly not applicable to this situation. However, we believe that the majority of 
results can be extended to the case m > n if the design matrix satisfies certain 
conditions for identifiability of small models, which are discussed for example 
in [11], [4] or [14]. These expectations are confirmed by the successful applica- 
tion of mBIC2 to analyze genome wide association study data, as reported in 
[24]. Theoretical analysis of asymptotic optimality properties of modifications 
of BIG under non-orthogonal designs is the topic of further research. 
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6. Appendix. 

6.1. Proof of Lemma 2.1. The proof of Lemma 2.1 relies on the following 
technical result. 

Lemma 6.1. Let On ^- a be any convergent sequence. Define 

Kin) := exp (cin-^ - ^] and h{^i) := exp (a-^ - ^j . Then 

(6.48) lim ||/i„||i„(^) = ||/i||L.o(^) . 

n— >oo ^ ' ^ ' 

Proof. First note that for all n it holds that /i„ G L°°{iy), and therefore also 
hn G -L™(z/),\/m > 0. It is easy to check that lim„ ||/i„ — h\\^ac(,y\ = 0. Thus for 
any e > and sufficiently large n we have \\hn — h\\j^„(^^s^ < \\hn — hW^aor,^) < e. 
Now (6.48) easily follows by the triangle inequality and the fact that 



lim„^00 ||/i|lLn(^) = ||/i||L»(j.). □ 

Now we are ready to prove Lemma 2.1. 
Proof. 

Let hnil^) = exp (a„^ — ^)- Then (5/)^/" = ||/ira||^n(,^) and due to As- 
sumption (A) lim.n{5fY'"' = e^'"^. Note that a„ has to be bounded, otherwise 
the sequence ||/in||^n(^) could not be bounded. Let a be an accumulation point 
of a„. By Lemma (6.1) for any subsequence Oj — )■ a it holds 



(6.49) lim||/i,||^,(,) 



^"PlV-2^j 
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If a G S* then ||/i||ioo(^) = exp (^) and taking logarithms yields a = —\fCa. 
Thus the only potential accumulation point of a„ within S is — T. To complete 
the proof of Lemma 2.1 we will show that a ^ S leads to a contradiction with 
Assumption (C). 

If a ^ 5 then a G {la-, Va) where la < Va are the boundaries of S, closest to a. 
It is immediately clear that either 

(6.50) l|/i|lLoo(,)=MU 
or 

(6.51) WHloo^,) = h{ra) . 

^^\ iirli on n d 1 12 

2 ' * 

, —L ^ 1 l\Tr~njr cinr-o n 'C' Ci "fnoc 

2 V-a 

imply 

ra <0 , T'^ > rl and T^ < laVa , 

and we conclude that —T G (laifa)- But according to Assumption (C) we have 
— T G S, which contradicts {la^^a) ^ S. 

Similarly, one can show that for any value a G {la, °^''° ) the case (6.50) also 
leads to a contradiction with Assumption (C) . 

Now consider the remaining case (6.51) and r^ = 0. Then (6.49) implies 
that T = 0. However, due to Assumption (C) /i has a positive density in some 
neighborhood of 0, in contradiction with r^ lying on the boundary of the support 
of //. 

The proof that bn ^ T goes exactly along the same lines. 

D 



6.2. Proof of Lemma 2.2. 

Proof. By Lemma 2.1 a^ converges to 0. Also, by Assumption (C) there 
exists e > such that ^{fi) has a density p{iJ,) on the interval (— e,e). It is 
immediately clear that 

(6^52, /,«,).. ,.)<e-.i.,.co,. 

(e,oo) 

Also, there exists no such that for every fx < —e and n > no it holds a„/x < /x^/4 
(because a„ — )■ 0). Thus for n > no 

(6.53) / hl{ix)dv{ix)<e-'^^u{-oo,-e) . 

i-oo-e) 

Concerning the integral over the interval (— e, e), by completion of squares 
one derives 

/ Kilj) p{lJ')dli = Pnexp (^j / exp (-n ^^""^^ ) dfi 

— e — e 

(6.54) = p„e^ ^ MV^{e - a„)/a) - ^^{-e - a„)/a)] , 
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where p„ G [ inf pifJ-), sup p(^)], and < inf p{fi) < sup p(/u) < oo 

according to Assumption (C). 

Note that ^{^/n{e — an)/cr) — )• 1 as well as <&(-y/n(— e — an)/cr) — > (because 
On — ^ 0). Comparing (6.52), (6.53) and (6.54) we observe that the integral over 
(— e, e) dominates the two remaining terms and from (2.4) it follows that 

l = y^a(/<5)-Vexp(^)(l + o„) . 
Thus we may conclude that the sequence 

.2 



Sn := ifSy^n-^/^ exp ( 



nat 



is bounded and therefore for any convergent subsequence it holds that 



o-\/logn + 21og((5/) 
On ~ 



n 



To get the exact behavior we further split the domain of the integral in (— e, —Qn)-, 
{—gni 0) and (0, e), where gn is a positive sequence such that a„ = o{gn), or more 
specifically 

logn log(6f) 

(6.55) gn-^0 with —V -^ 0, \^ -^ . 

ngi ngi 

For the first interval we get a bound by evaluating the integrand at —gn, for 
the second and third interval we repeat the computations leading to (6.54) with 
the corresponding boundaries, and finally obtain 

5f= j K{p)p{p)dp{l + On) = ^("'^^ exp (^) (1 + On) 

which yields (2.5). The proof for 6„ is exactly the same. D 



Remark 6.1. The proof of Lemma 2.2 relies upon choosing a suitable se- 
quence gn- The choice of the sequence gn strongly depends on the asymptotic 
behavior of 5f. If for example for sufficiently large n, 5f < n", with a > 0, one 
might use gn = -^^ , the choice of [29] . Another situation occurs if 5/ ~ e" 

with < 7 < 1, where gn = n^'^'^ is a suitable choice. 
6.3. Proof of Theorem 2.1. 

Proof. 

Notice, that the type II error of the Bayes oracle is given by ^2 = / ^n{p) dv{p) 
with 

x^ I )- ^ fV^iK - P)\ ^ ( \/n{an - p) 



^ f V^{bn-p) \ _ ^ fVn{i 



a 
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We will now calculate the asymptotic formula for the type II error in case 
when C = 0. Consider first the integral over ^ £ (— oo,0). Remember that 
On — )■ 0, thus for n sufficiently large u has a density p(/u) on (2a„,0) and it 
holds that 





^n du 



jUh^ 



^) 



$ 



f y/n{an -/x) 



p{lj) dfi . 



Applying the mean value theorem and substitution yields 





/ 



*n du 



a 



Pn 



n 



2a„ 



$ 



\/n(&n - ar, 



a 



z] --^(-z) 



dz 



v^^ 



for some p„ G [ inf pip), sup p{p)]- Using the facts that / (^{z)dz = x 

Ate(2a„,0) ^e(2a„,0) -x 



and 



-oo we further obtain 



^„ di^ 



a 



Pr, 



2aji 



V^ r 



j [1 _ <I>(_^)] d^ _ j ^L+}l^^an-hn)\ 






Vn 



logf 



= -p(0-)a„(l + On) = ap{d-)\^{l + On) ■ 

V n 

where the last equality holds due to (2.7). 

It remains to show that the integral over (— oo, 2a„) is of lower order. It holds 
that 

^„, du < / (l - <^{yjn{an - p)/<y))) du 



< 1-$ 



i-anV^/a) = O (^{vlogv)'^/^^ 



Assumption (A) yields f6logv — t- oo, and hence (f logf) ^'^ = o 
Similar computations for the interval (0, oo) lead to 



logy 



(6.56) 



t2 = cr 



yi^(p(o-)+p(o+))(i+o„(i)). 



In case of < C < oo we know from Lemma 2.1 that an -^ —T and 6„ — ^ T, 
where T = aVC > 0. For p. G {-T,T), ^„(/u) -^ 1, while for p G (-oo,r) U 
(T, oo), ^ra(/w) — ^ 0. Then by the dominated convergence theorem, 

/oo 
^i>nip)duip) = Ui-T,T) (l + On), 
-oo 
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and u{—T, T) > 0, since the distribution has a positive density in neighborhoods 
of — T and T. 

The Bayes risk can be written as 

R = mp5At2{l + f5ti/t2) . 

Thus by (6.56) and (6.57) to complete the proof of Theorem 2.1 it is enough to 
show that 

(6.58) f6ti/t2 -^ . 

In case of C = 0, (2.7) and the normal tail approximation yield ti oc {v log v)^^'"^ . 
Thus from (6.56) we easily obtain 

12 \/V log V log V 

In case of C > we write ti = tia + tit , where tia = $ {\/nan/o') and 
tif, = 1 — <I> {-s/nhn/cr) ■ Using the fundamental equality (2.4) for a„ yields 

Because a^ — )• — T similar considerations as in (6.52) show that the integral 
vanishes rapidly for ^ ^ {—T — e, —T + e). Now observe that 



I r-T+e 



exp (-^(«n - f^f) />(^) dfi < Mp^= / exp (-^K - m)^) c^M 



\/27r J-T-e 

where Mp = sup^g(_2-_g _7^_j.g) p{fi) < oo. Moreover, 

^^P ( -7rT(on - ^)^ ) dfi= ^ . 



2tt Jr V 2o-2 / ^ 



Thus we finally obtain 6ftia = O (-). Analogous considerations for tn) finish 
the proof. 

D 



6.4. Proof of Theorem 2.2. 

Proof. First consider the case C = 0. To prove sufficiency of (2.12) and 
(2.13) for ABOS of a fixed threshold rule note that computing type II error 
for rules of the form (2.11) involves similar computations to those leading to 
(6.56), but using a„, and 6„ instead of a„ and 6„. Taking into account (2.11) 
and (2.12) one thus obtains 



*„ du = -p{0~)dn{l + On) = p{0-)a\^{l + On) , 

V n 
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which is asymptoticahy equivalent to the first contribution of the type II error 
of the Bayes Oracle. On the other hand 

2an 



/ "^n dv <l- ^{-dnVn/cr) ~ 



exp(-Za/2) _ / /log?; 



^y2TTv[logiv) + Za] \ V n 



where the last equality follows from the first part of Assumption (A) and (2.13). 
Similar calculations on the interval [0, oo] yield 







Thus the type II error component of the risk R2 = mp6jit2 satisfies R2 = 
i?^(l + o„). 

Now, using (6.56) and the tail approximation for the type I error we obtain 

/«.QX R /rB m{l-p)6oh ^ exp{-Za/2) + exp(-Zfc/2) ^^ , ^ ^ 

(6.59) Ri R = -75 = Co-p (1 + On) , 

R^ log V 

where C^p = ^— J_ — ttht^- Thus under assumption (2.13) Ri = o{R^), 
which completes the proof of sufficiency for C = 0. 

In case of C > due to (2.12) it holds that a„ — )■ —T and 6„ — )■ T, and 
hence thresholds specified by (2.11) also have type II error of the form (6.57). 
For sufficiency it remains to establish (6.58). To this end note that the type I 
error can be written approximately as 

1 exp(-2:a/2) + exp(-Zb/2) 
^/2^T \/v log V 

Hence 

(6.60) R^/R^ = g^ exp(-zj2)+exp(-V2) ^^ ^ ^^^ 

logw 

where C,y = -jjrz^j^. Thus, under assumption (2.13) again i?i = o{R^), and 
the proof of sufficiency is completed. 

Concerning necessity, similar arguments as in the proof of Theorem 3.2 of [8] 
show that (2.12) is necessary for ABOS. In that case the computations leading 
to (6.59) and (6.60) are still valid and imply the necessity of (2.13). 

D 



6.5. Lemma on the existence of the exact BFDR controlling rule. 

We first prove the following result 

Lemma 6.2. For any fixed s 7^ the function 

, . _ 2_-$(c-s)-$(c + s) 



2(1 - <I>(c)) 
satisfies 
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a) /(O) = 1, 

b) lim /(c) = oo, 

c) f{c) is increasing in c for c > 0. 

Proof. 

Points a) and b) easily follow by elementary algebra. To prove point c) let 
us define 

g{c) := (1 - <I>(c))0(c - s) - (1 - He - s))0(c) . 

Then straight forward calculations yield 

/'(c) >0 4^ g{c + s)>g{c) 

Let us consider at first the case of s > 0. In this situation it is enough to show 
that g(c) is increasing. We find 

g'{c) = (1 - Hc))4>'{c - s) - (1 - He - s))4>'{c) 

and define h{c) = ^'t^fc) • Then clearly 

g'{c) > <^ h{c- s) > h{c). 

To show that h{c) is a decreasing function observe that 

"'(^l = 2.(1 -^(e))^ '"''' (^<'^' - 'X' - *<'=» - """''') ■ 
Now, the standard bound on the tail of the normal distribution yields 

V2^c^{l -Hc))< ce-"'/2 , 

which implies that h'{c) < 0. 

The proof for s < goes analogously. In that case g{c) has to be decreasing, 
which yields h{c) > h{c — s), and again h{c) is a decreasing function. 

D 

The following Lemma 6.7 easily follows from Lemma 6.2. 

Lemma 6.3. Let z/(-) be any probability measure such that iy{0) < 1. Let us 
define 

('•'') ^(^) ■= 2(1 -^(c)) • 

Then it holds 

a) H{0) = 1, 

b) limc^oo H{c) = oo, 

c) H{c) is increasing on [0, oo]. 
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Lemma 6.4. Let z^(-) be any probability measure such that z^(0) < 1. Let 

BFDR{c) = ^\-J^''^'^ ,,, , 

^ ' (l-p)ti(c)+p(l-t2(c)) 

where 

ti(c) = 2(l-c&(c)) 

and 

t2{c) = 1 - / (^(-c - \/nii/a) + 1 - $(c - ^/n^J,/a))du . 

T/ien BFDR{c) is continously decreasing from 1—p for c = to for c — )■ oo. 
Proof. Observe that 

BFDR(c) ^ 



1 + T^^(^) ' 
with H{c) as in (6.61). Thus Lemma 6.4 is a direct consequence of Lemma 6.7. D 

6.6. Proof of Theorem 6.1. 

Proof. 

Let us define u^ = CBcr/y/n. First we want to show that u^ is bounded. 
Assume on the contrary that for some subsequence u^ — )■ oo. It holds for any 
constant i^ > 0, that 

j H^j{-uf - fi)/a)du + 1- j ^{^j{uf - fi)/a)du 
> (j.(_oo, -K) + u{K, oo))(l - <l>(V7(nf - K)/a)) . 



and obtain from (2.18) 



If u^ — 7- oo we can apply the tail approximation for the normal distribution 



But on the other hand the second assumption of (2.19) yields ( -^ I — > 
exp( — Co) which contradicts u^ — t- oo. 

If Uj := uj — 7- n < oo then the denominator of (2.18) converges to a constant 
Cu,u = 1 — v{—u,u). Under the first assumption of (2.19) equation (2.18) can 
only hold if \/jUj — )■ oo. Thus we can apply again the tail approximation to 
obtain 

a, /2 1 — a, / Co, ,, , 
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Taking logarithms and some simple calculations yield 



-=-".^~'l)-l„. 210, /)+ 10. f) +2,0.(1^ 



Now, the second condition in (2.19) implies that u = (T\/2Co, which completes 
the proof of (2.20). 

The critical value has exactly the same form as in the case of normal dis- 
tributions and the result on ABOS follows exactly the same way as in [8]. 
Define s„ := °^ (t/^\ — 1, then necessary and sufficient conditions for opti- 
mality are s^ — ^ and 2s„log(//n) — loglog(//a) — )■ — oo which immediately 
provides (2.21). From the first equation in (2.21) it follows that in case of ABOS 
Co = C/2, where C is the constant from Assumption (A). D 



6.7. Lemmas needed for Theorem, 2.5. To prove optimality of the type II risk 
component of SD in the denser case we first show that with large probability the 
random threshold of SD is bounded from above by the asymptotically optimal 
threshold ci„. 

Lemma 6.5. Let csd be the random threshold SD threshold at the level an 
and let c\ = ci„ he the GW threshold (2.27) at the level ai„ = oin^m, where 
im = (logm)^'^ with s > 1. Suppose that Assumptions (A) and (C), (2.28), 
(2.29) and (2.31) hold with a = Un. Then ci is ABOS. Moreover, for every 
7„ > ii holds for sufficiently large m = rUn that 

(6.62) P{csD > ci) < m-'^" . 

Proof. Based on the second condition in (2.31) it is easy to show that ai„ 
satisfies the asymptotic optimality assumptions provided in Theorem . Thus, 
Theorem 2.4 immediately yields that ci is ABOS. 

To prove the second assertion of the Lemma we first note that by Lemma the 

of Cl, 



function H{c) := \_p/') is decreasing. Therefore according to the definition 



(6.63) {csd > Cl} = [h{csd) < am} • 

On the other hand the definition of csd actually gives 

2(1 - ^csp)) 

= On 

1 - Fmicsn) + 1/m- 
and thus 

Taking another intersection of the right hand side with {csd ^ ci} we can 
conclude that 



(6.64) P{csD > Cl) < P ( inf 1 Fm{c) + l/m ^ ^ 

\ c>ci 1 — I< (C) 



m 



imsart-imsgeneric ver. 2009/08/13 file: AB0S2_arxiv.tex date: July 13, 2011 



F. FROMMLET ET AL./ABOS FOR GENERAL DISTRIBUTION 34 

Using the standard transformation Ui = F{\Zi\) one obtains 

P{CSD > Cl) < P mf < S,rn 

yte[21m,l] 1 - t J 

where zim = F{ci), and Gmi't) is the empirical cdf of C/i, . . . , {7„j. Now, using 
the transformation u = \ — t and observing that Vi = 1 — Ui also has a uniform 
distribution we obtain 

D/ ^ ~ \ ^ r>( ■ t Gm{u) + 1/m , . \ 

P{CSD > Ci) < P mf < ^rn 

\«e[o,i-^im] u J 

This is equivalent to computing the probability that the empirical process 
Graiu) intersects the line L = —-^ +uEm within the interval [-4—, 1 — zim]- For 
this type of problem Proposition 9.1.1 of [37] can be applied. Define the event 

Bi = {Gm{u) intersects the line y = {u—a)/{bm) at height i/m but not below} 

Then 

P{Bi) = ("^^ a{a + iby~^{l -a- ib)""-' . 

In our case a = b = — ^- and thus 



m\ 1 / 1 + t\ I l + « 



and P{Bi) = for i > m^rn — 1- 

Now, similar to Lemma 10.3.1 of [37] (page 414) we can apply Stirling's 
formula, which for i < mS,m — 1 yields 

P(B) m! C^ + ' X ( 1 + ^'""' 

{i + iy.{m - i)l \m^rn J \ rn^m 

m™+i/2e-™\/2^exp(l/12m) (l + iVf l + i 

{i + l)^+3/2e-(i+i) V2¥(m - i - l)™-i+i/2e-(m-i)^/2^ [m^J \ m^„ 

exp(l/12m + 1) 1 : fm-{l + i)/^r, ^ "'"* 



exp(l/12m + 1) I ^ f Jl + ^ ^ 

V2^ (. + 1)3/2^1377^ ^-"^P^ \^U 

In the last step we adapted the inequality ( 1 — \^_i ) < e"*^'*'"^^ used by 
Shorack and Wellner in the proof of Lemma 10.3.1. In summary we find that 

P(i?,)<i^C'exp(-(i + l)/U). 

for some constant K which can be chosen such that it does not depend on m 
or i. As long as t— exp(— 1/^^) < 1 we then have 

P{csD > ci) < KY,a^M-{^ + 1)/U) = K _T^~V!t!. ^ • 
i=o ^ U ^^P'^ ^i^^'^> 
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Remembering that ^^ = (log?7i)~* with s > 1 finally yields (6.62). 

D 

The next lemma discusses the type II error component of the risk of SD. 

Lemma 6.6. Under the assumptions of Theorem 2.5 the type II error com- 
ponent of the risk of SD satisfies 

(6.65) RA<RB{l + Om) . 

Proof. For the extremely sparse case (2.14) we have seen already that the 
result follows by comparing with the Bonferroni rule which is ABOS according 
to Lemma 2.3. It remains to show the result for the denser case (2.30) and to 
note that both cases overlap. 

Denote by La the number of false negatives under the SD rule and let ci be 
defined as in Lemma 6.5. Clearly 

E{La) < E{La\csd < ci)P{csD < ci) + mP{csD > ci) , 

and furthermore 

E{La\csd < ci)P{csD < ci) < ELi , 

where Li is the number of false negatives produced by the rule based on the 
threshold ci. Since by Lemma 6.5 the rule based on ci is asymptotically optimal, 
it follows that 6aELi = Ropti^ + Om)- On the other hand P{csd > ci) < m"'''" 
for any 7^ > if only m is sufficiently large, and therefore 

Ra = SaELa <Rb{1 + Om) + SAm^-^- . 

Now by using assumptions (2.30) and (2.31), and choosing e. g. 7^ = 72/2 + 1, 
we conclude that 6A'm^~^'^ = o{Ropt), and the proof is thus complete. D 

6.8. Lemma needed for Theorem 3.1. 



Lemma 6.7. Assume that (3.33), (3.34), (3.38) as well as assumptions (B) 
and (C) hold. Then the following hounds are valid for the type I and type II 
error rates of niBIC: 



(6.66) ^ = 0((n log n)-i/2), t2 = O ' '^°^'' 



p \ / \ \ n I 

Proof. Let hn,m '■= logn + 21ogm + d. From the tail approximation of the 
standard normal distribution we immediately obtain 



-e 



fe„,™/2 < iL(^iogn)-^/2 



^'^n.m. '^ 
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for some constant c. Using the fact that mp — )■ s > from assumption (3.38) 
gives the first bound of (6.66). 

To bound type II error we proceed similarly as in the proof of Theorem 2.1. 
We have t2 = /^n(/^) du{jx) with 

^n{p) = ^ (^jK,m- ^^) -'^ l-Vhn,m- ^^ J • 

The asymptotic behavior of this integral is obtained by similar analysis like 
that leading to (6.56), resulting in 

(6.67) t2=a{p{0 )+/9(0^)) ^ {l + On,m), 



D 



n 



which completes the proof of Lemma 6.7 (since m < n). 



6.9. Proof of Theorem 3.2. 

Proof. In [1] it was shown that the step-up procedure BH corresponds to the 
smallest local minimum of the selection criterion (3.40), whereas SD corresponds 
to the largest local minimum of (3.40). Now niBICl is searching for the global 
minimum of (3.41), but we can again consider the smallest local minimum as 
well as the largest local minimum of (3.41). These will correspond to step-up 
and step-down procedures based on the comparison 



n/32 



Y- > log nm^ + d— 2 log(A;) — log \og{nm^ /k^) 



a 

Translating this comparison to the level of p-values when applying the usual 
tail approximation for the standard normal distribution yields 

Ak 2e^'^ 

(6.68) p[k] < — with A^ = -, 

^ ' m TTnz(k,m,n) 

where z[k,m,n) = 1 -\ — ~ "§ .°^^""! /^ — -. Since for sufficiently large n 

21oglogn t -, ^ n \ ^ t \ ^ loglog" 

1 = zi[n) < z{k,m,n) < Z2{n) = 1 — , 

log n 6 log n 

it holds that mBICl can be sandwiched between the step- up and step-down BH 
procedures with the FDR levels Oj = \/ -^^-r^, i = 1,2, correspondingly. Since 



Trnzi(n) ' 

both ai satisfy (2.21) the conditions of Theorem 2.5 are fulfilled and mBICl is 
itself ABOS. 

Similar considerations give the result for mBIC2, for which we obtain z{k, m, n) 
\og{nm?' /k"^ + d) in (6.68). Using the inequalities 

logn < z{k,m,n) < 31ogn 
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for n large enough to sandwich niBIC2 between step-up and step-down pro- 
cedures, ABOS of niBIC2 follows immediately from the fact that a oc , } 

•^ yjn log n 

fulfills (2.21). 

Finally for mBIC3 we get 

z{k, m, n) = 6^(1 - l/fc)2(^-i) (log(nmVfc^) + d + 2 + 2{k-l) log(l - l/k)) , 

and we find again logn < z{k,m,n) < Se^logn which yields ABOS as above. 

The consistency result is obtained the following way. From ABOS and the 
Markov inequality in (3.39) one easily concludes that all three criteria are con- 
sistent exactly when the Bayes oracle is consistent. Consider the asymptotic 
formulas concerning type II error (6.56) and type I error (6.58) for the special 
case 5 = 1. Then it immediately follows that the Bayes oracle is consistent 
under assumption (3.38). 

D 



6.10. Proof of Theorem 3.3. 

Recall that in our setting (two groups model, orthogonality) it is reasonable 
to think in terms of type I error (misclassification of a regressor under Hq) 
and type II error (misclassification of a true signal) for model selection proce- 
dures. To prove Theorem 3.3 we first bound the type I and the type II errors 
in Lemma 6.8 and Lemma 6.9 respectively. Both these results will be proved 
assuming minimal conditions under which the individual bounds on type I and 
type II errors hold. The conditions in Theorem 3.3 ensure that both lemmas 
hold and additionally that the overall upper bound on the total risk of mBIC 
is asymptotically equivalent to that of the Bayes Oracle. 

To bound the type I error we will make use of the following corollary given 
after Theorem 2 of Section 16.7, Vol.2 of [21]. 

Corollary 6.1. Let F he the common distribution function of i.i.d. ran- 
dom variables Xi, . . . ,Xn with E{Xi) = 0, Var {Xi) = a"^ and let F„ be the 
distribution function of the normalized sum {Xi + • • • + Xn) / {^/ncr) . Ifl<x = 
o{\^), then for any e > 0, for all sufficiently large n, 

(6.09) exp(-a + .).V2)^^_^__^^^^eg^-(l-.).V2) 



X 



Lemma 6.8. Assume n — ^ oo, m = m{n) — > oo and that (3.33), (3.34) o-f^d 
(3.4-5) hold. Then the type I error probability of the decision rule based on mBIC 
criterion (3.44) is hounded by 

(6.70) ti <-^{l + On,m) , 



nm 



withCi = ^exp{-d/2). 
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Proof. Let 1 < i < m. Then the probabihty of type I error correspoding to 
Pi is given by 

tu = P{Ai\B^), 

where Bi denotes the event that /3j = and Ai denotes the event that the 
corresponding regressor is included in the model chosen by mBIC. Through 
exchangeability, it follows that tu = ti, for each 1 < i < m. Let us compare two 
models M and M\jXi where Xi ^ M. mBIC considers supplementing model 
M with the variable Xi only in the case 

RSSm . log n + 2 log m + d 

(6.71) log — > 

RSSM-nPf n 

where RSSm = Y'Y — ^ n/3^ denotes the residual sum of squares of the model 

M (we have used the orthogonality assumption (3.33) here). Henceforth we will 
use the abbreviations Zim '•= log ^^^-^ and Unm '■= -^^^ — "^"^ — . Note 

' RSSM-npf ' " 

that if Ml C Ms then RSSm^ > RSSm2- Therefore RSSmi/{RSSm, - n^f) < 
RSSm2/{RSSm2 ~''T-Pi) ^-iid the event that a given false positive is added to the 
model Ml is contained in the event that it is added to the model M2 . According 
to these considerations we obtain an upper bound for the type I error 

tl<P{ U {Zi,M>Un,m}\Bi), 

where 0,l is the set of all models with L — 1 regressors in addition to the the 
common intercept term, such that Xi ^ 0,^. We bound the probability of the 
right hand side above in three intermediate steps. 
Step 1: Let T, = J and e„,^ = i°ga°g»+2iogn^) _ Then 



and therefore 

P{ (J {Zi^M > Un,n,}\Bi) < 

MeUL 

-P( [j {Zi^M — Ti > en,m}\Bi) + P{Ti> Un,m — en,m\Bi) . 

The second term on the right hand side of the above inequality can be expressed 
as 

/n/32 
P{Ti > Un,m-en,m\Bi) = P -^ > log n + 2 log m + d - log(log n + 2 log m)|Bi 

and from the normal tail approximation we obtain 

\/nm 
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where Ci = \ - exp(— (i/2). Now to establish inequahty (6.70) it remains to be 
shown that ^(Ujuen {^i,M — Ti > €n,m}\Bi) is of lower order. 

Step 2: Let 5„ = -. Similar arguments as in Step 1 yield 

{Zi^M-Ti > en,m} C {Zi^M > - log(l-(Ti+(5„))}U{-Ti-log(l-(Ti+(5„)) > £„,„} . 

The first set on the right hand side can be rewritten as { j^gg > Ti + 5n] and 
therefore 



P{ \J {Zi^M — Ti> en,rn}\Bi 



52 



To bound the second term note that — log(l — x) < a; + 2%^ for < a; < 1/2. 
Hence 

P(-T,-log(l-(r,+,5„)|) > en,ra\Bi) < P ({T, + 5nf > '"'"'^~ '^" |-B, Vp(r,+,5„ > 1/2|S, 
First note that 



and for sufficiently large n the normal tail approximation yields 

P\{Ti + 5n) > \Bij<exp\ — — 

Second we have 

P(.T, + 5n> l/2\Bi) = P{nTi > n/2 - l\Bi) < ^ ^ exp(-(n/4 - 1/2)). 

Combining the two bounds obtained above, it follows that 

P{-T, - log(l - (T, + Sn))> en,m\Bi) = o 

since m < n. 

Step 3: We will now bound the remaining term 



nm 



^1 U liJ::>^. + i|l« 






1 RSSm 
Observing that 



'!f rjs^-iUi cM>c„Jui "" 
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where we choose Cn^m = logn + 21ogm, we conclude that 

^f U |7#>r. + i||B.]<Ff4£<=»,™ls.VE^(7^-i£— IS.) 

By the tail approximation of the standard normal distribution we obtain that 
for sufficiently large n 

-* I 9 ^ '^n,m\-t3i I — / — Onm ■ 
V CJ^ / Jnm 



We now have to bound the remaining series. Fix any model M G VLl and assume 
that k true signals are not included in M. Under assumptions (3.33) and (3.34) 

it immediately follows that RSSm = Wk + Z^, where Z^ = n ^ /3? refers 

r=l 

to the k true signals which were not detected, and Wk ^ cr'^xfn~L~k) ^^ ^^^ 
remainder term. Zk and Wk are independent and Zk is stochastically larger 
than a cr'^x't distributed random variable. Therefore RSSm is stochastically 



larger than a'^xf __rv But this argument holds for any k, and we conclude that 



Now 



/ na"^ 1 , \ I n 1 

\HSSm Cn,m J \Xn-L ^".'^ 



n ^-^ 1 ] _p[ xLl-("--^) ^ ^"T+t;; 



\xi-L Cn,mJ \ y/2{n - L) y/2{n-L)J 

will be bounded using a normal tail approximation argument. /,From assumption 
(3.45) it follows that L = o{n/cn,m.) and therefore 

{^ I Ofim) ■ 



Y^2(n - L) \/2(log n + 2 log 



m) 



Applying Corollary 6.1 with x = ^(i^g^^iogm) " o(Vn - L) yields that for 
every e > and n large enough (dependent on e) it holds 

F : — < — : < exp 



v/2(n - L) ^2{n-L)) \ 2(logn + 21ogm; 

The number of models with L — 1 regressors is (^"!^) < rn . Thus, for suffi- 
ciently large n 



.sn 



/^ "^^ 1^1 ^^_L.™/^ (l-e)n \_^f 1 



1 > < fW exp —77; TTT = o 



^^^ ^RSSm Cn,mJ V 4(logn + 21ogm)27 \^/nm 

which finishes the proof. D 

Next we compute a bound for the type II error: 
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Lemma 6.9. Assume n — > oo, m = m{n) — > oo and that (3.33), (3.34), 
(3-4-5), (3.46), (3.38) and Assumption (C) hold. Then the type II error of the 
decision rule based on mBIC criterion (3.44) ^^ hounded by 

, , -. , I , s \/loa; n + 2 log m , 

6.72 t2 < a{p{0-) + />0+ r ^ 1 + On,m) 

\/n 



Proof. Let 1 < i < m and suppose Bi is the event that /?» 7^ and let Ai 
be the event that the corresponding regressor is not detected. Then we have 
type II error t2i = P{Ai\Bi), and by exchangeability it follows that t2i = ^2 is 
independent of i, for each 1 < i < m. 

Let us introduce the symbol D to denote the event that none of the Xj^s 
corresponding to the null hypothesis are included in the model chosen by 
mBIC. Similar to the proof of Lemma 6.8 one can show that for every i 7^ j, 
P{A,\B„B,) = O (^). Then P{D-m = O(^) and thus 

t2 = PiAinD\Bi) + 0{n-^^^) . 

To shorten the notation we now define Af = AiCi D. 
Note that 

m 

t2 = Y,P(^F\K = k,B,)P{K = k\B,) , 

k=l 

where K is the number of nonzero /3's among /3i,.../3m. Under assumption 
(3.34), given Bi, K — 1 has a binomial distribution B{m — l,pm). Define L' = 
[?n,pm(logn)^+''J , where [z\ denotes the largest integer less than or equal to z. 
Using the assumption (3.38) and Bennett's inequality, it is easy to show that 

(6.73) P{K > L'\Bi) = o{n-^/'^). 
Thus 

L' 

(6.74) t2<Y, Pi^f\K = ^' B,)P{K = k\Bi) + 0{n-^/^) . 

k=l 

Note that here we made use of assumption (3.46). 

Given K = k, let the ordered values of the squares of the estimates of the 
regression coefficients corresponding to the true regressors among Xi , . . . , X^ 
be denoted by f3^^^ < /3^^-^ < ... < fSf^y Clearly 

fc 
P{Af\K = k,Bi) = J^P(if lA = ^(r),K = k,Bi)P0i = /3(,)|ir = k,Bi) , 

r=l 

and using the fact that /3j's corresponding to true signals are i.i.d. continuous 
random variables, the above equation can be rewritten as 

1 '' 
PiA^\K = k,R^ = -Y,Pi^-)\K = k) , 

r=l 
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where AP^ is the generic event that neither the regressor corresponding to (3tr) 

false positives are detected 
nested, i.e, AP -^^^ C AP^, and thus 

(6.75) i J2 P(^(r)\K = fc) = E 'k^(^fr) n (if.+l))1i^ = k) 

r=l r=l 

where we define {AP,,-^^^\K = /c} = 0. Thus we can write 



nor any false positives are detected by mBIC. Note that the events AP-.^s are 

r=l r=l 



L' k 

k' 



*^^T.I1 jnAfr) n (ig+i))1i^ = k)P{K = km + 0(n-V2) 



k=l r=l 

The event AP. n (^^ij^O'^ implies that the model chosen by mBIC includes the 
(k — r) true regressors having the largest absolute values of estimated regression 
coefficients, denoted by Xt.f.^i\, . . . ,Xti^\, while Xrj\ for 1 < j < r are not 
included. This event also corresponds to the situation when no false positives 
are included. Hence the model includes k — r < L' < L regressors, and in any 
case we have not yet exhausted our maximum model size. So, since Xr^\ is not 
included in the model we can infer that mBIC criterion is getting larger by 
adding Xr^y Denoting by RSS^-r the residual sum of squares of the model 
including Xt,^._^i\, . . . , ^(fc) (or only the intercept in case r = k), we have 

P(ig) n (ig+i))1i^ = k)<P flog i ^J^^'-'^, ] < Un,m\K = k] . 

Since for x £ (0, 1), log(l/(l — x)) > x, 

(6.76) p^Al,^^{A^^^^^r\K = k)<p\—l^^<UnA^ = k\ . 

Under K = k, RSS^^r is the sum of two independent random variables, the 
first being a a'^Xn-k-i random variable {Xn-k-i being a central chi-square with 

r 

{n — k — 1) degrees of freedom), while the second is Yl ^/5?v Therefore 



\ 



/ n/32 \ 



n/3f,) 



< u 



'n,m 



^ E n/32 



^'xf^-k-D + E «/5(,) 



and because of /3? -, > /3?.s, for r > j one obtains 

P ( j^gj < Un,m\K = k\ < P{n(3f^^ < Cr^xLfe-l^n,m(l + On,m)) , 
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where {l + On,m) = i^ru ■ (^ote that r < k < L' = mpmilogn)^'^^ , and under 

assumption 3.45 run,m — )■ as n — )■ cx)). 
Define 6„=i2|(l^. Then 



n — k — 1 + 



\/2(n -k- if-^A /n = 1 + o„ 



and therefore 

?2 ^ 2, ,2 



+P ixl-k^i >n-k-l + V2{n -k- l)!-^-^ 



A simple appHcation of Chebyshev's inequahty yields 

P (xLfe-i >n-k-l + \/2(n -k- if-^A <{n-k- 1)-^+"^^^ 



Now, observe that 



V k 



ZZ^P (xtfe-i >n-k-l + V2in-k- 1)1-^") PiK = k\B,) 

k=lr=l 

<{n-k- l)-i+26„^((_^ ^ i)/2) < mp (n - fc - l)-i+2bn = o{n~^/^) , 
where the last equality follows after some calculations from (3.38). Therefore 

L' k 



k=lr=l 

Let us define 



(6.77) t2<Y,J2k ^^^fr) < ^'^n,m(l + o„,„)) P{K = k\B.i) + 0{n-^ 



-1/2N 



(6.78) q = qn,m ■= P i/^f < (T^Un,m{l + 0„,„ 

Given that /Si ^ v * M{0,a'^/n) straight forward computations yield 



9 ^nu,„,m(l + On,m) ~ ^ -^nUn,m{'^ + On,m) 



a 



a 



dv{jx). 



fi&IR 



The asymptotic behavior of this integral is obtained by similar analysis like 
that leading to (6.56), resulting in 



O-(p(0 )+p(0^)) j= (l + On,m)- 



n 



Define the contribution for fixed r in the sum on the right hand side of (6.77) 



as 



*r := Y. T^^^fr) < <^^Un,„,{l + 0„,„)) P{K = k\Bi). 



k=r 
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For r = 1 we observe that 

P(/3fi) < a'un,m(l + On,m)) = I - {1 - q)^ = kq - Y, (^) (-QV , 



with the convention that (2)=0- 
Thus 



L' k 

fc=2 i=2 ^-^^ 



j=2 •' fc=i 
L' j 

- ' + §70^'""'"' 

< q + q^mpe'^P'^ = q{l + o{q)) . 

as long as rapq — t- which is guaranteed by (3.38). The first inequahty fohows 

from the fact that under Bj, K ~ Bin(?7i — l,p) and thus 

(6.79) 

E{{K-l){K-2)--- (K-j + l)) = {m-l){m-2) . . . {m-j + l)p^-^ < (mp)^-\ 

Finally we have to bound the contribution ^r in the sum on the the right 
hand side of (6.77) stemming from r > 1. Note that 






V 7/ 
Similar computations as above using (6.79) yield 



^^ = E ^ E (') '^^■(1 - '^)'"''^(^ = ^1^^) ^ E (j^^'"' - ^'^'"'^ 



Summing over all possible values of r > 1 finally gives 

L' 2 

^—( 1 — as 



qs 

r=2 ^ 



Thus we have shown that 



L' 
t2 < E ^^ + 0{n-^/^) < g(l + On,m), 

r=l 

since 0{n^^''^) = o{q). This completes the proof of the lemma. D 
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Proof of Theorem 3.3 

Proof. First note that the assumption on mp in (3.36) is stronger than 
that in assumption (3.38). Given (3.36) it is easy to see that the type II error 
estimate in Lemma 6.9 is asymptoticahy of the same form as the type II error 
of the Bayes oracle for C = in (6.56). To show ABOS it is therefore sufficient 
that the risk component of the type I error is of smaher order than the Bayes 
risk. From Lemma 6.8 we conclude 

"^ o 



Rbo \ mp^/iog 

Under Assumption (B) S is bounded from above and ABOS follows. 

Consistency follows exactly the same way as in Theorem 3.1. D 
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Fig 3. Simulation runs for known a. Misclassification probability (MP), False Discovery Rate 
(FDR) and Power for different selection rules and sparsity parameter p at values of p £ 
{0.001, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2}. 
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Fig 4. Simulation runs for unknown a . Misclassification probability (MP), False Discovery 
Rate (FDR) and Power for different selection rules and sparstty parameter p at values of 
p G {0.001, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2}. 
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